Authors: Dang Le and Vasu Dadhania
SOEN 6111 - Winter 20201
Present to Dr. Tristian Glatard
In this notebook, we will present the steps and results that are used to analyze COVIS 19 cases
data set from New York Times. This dataset is updated daily on github and therefore
represent an accurate number of confirmed cases in US counties over time.
First we would like to introduce the dataset, which can be obtained from https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
import plotly.graph_objects as go
import json
from urllib.request import urlopen
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
total_live_data = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
liveData = pd.read_csv(total_live_data, dtype={"fips": str})
liveData.head(5)
| date | county | state | fips | cases | deaths | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-21 | Snohomish | Washington | 53061 | 1 | 0.0 |
| 1 | 2020-01-22 | Snohomish | Washington | 53061 | 1 | 0.0 |
| 2 | 2020-01-23 | Snohomish | Washington | 53061 | 1 | 0.0 |
| 3 | 2020-01-24 | Cook | Illinois | 17031 | 1 | 0.0 |
| 4 | 2020-01-24 | Snohomish | Washington | 53061 | 1 | 0.0 |
print(f'Number of rows are {len(liveData)}')
print(f'Number of counties are {len(liveData.county.unique())}')
print(f'Number of distinct dates are {len(liveData.date.unique())}')
print(f'Number of distinct fips are {len(liveData.fips.unique())}')
Number of rows are 1196352 Number of counties are 1930 Number of distinct dates are 442 Number of distinct fips are 3219
As shown above, the data we are using have 6 columns:
There are over 1 millions rows of data given, with 1,930 counties reporting
covid cases over the time period of 441 days. Furthermore, we can see that there are over three
thousands fips numbers, which is almost twice as many counties, meaning that each county can
be broken further down into smaller regions represent by fips.
Because the given data is already sorted by date, we can see that Washing was the first
state reported confirmed case, which is quite interesting. Furthermore, the data is very sparse in
the early stage of the time interval, thus making it quite difficult for clustering because there
are not enough information. Thus, we decided to transform the data into matrix form, with
the rows represent each fips, the column represent the date and the value is number of
confirmed cases for each county at a given day.
To start this transformation, we first observe that we only need 3 columns, date, fips and cases, because we can deduce the state and county from any given fips. Also the deaths confirmed number contains many zero, so we decided to drop that column.
covidData = liveData[['date','fips','cases']]
groupByFips = covidData.set_index('date').groupby('fips')
groupByFips.cases.mean().describe()
count 3218.000000 mean 3316.863287 std 11715.329320 min 1.000000 25% 366.556524 50% 852.515182 75% 2174.934757 max 383966.876430 Name: cases, dtype: float64
After grouping the data by fips number, we calculate the means value of cases reported and call
the describe method to find some statistics. Looking the count of 3218 which equals to the number
of fips we find above and confirm that the group by is correct.
Looking at the mean and standard deviation value, we can see that the data is more spread out, which
indicate that clustering technique would be useful in this application to identify and group
similar data points. Furthermore, from the distribution information, we can see that there are
abnormally, where some group have mean value of only 1 cases, and some having over 300,000 cases.
plt.hist(list(groupByFips.cases.mean()))
plt.rcParams['axes.facecolor'] = 'white'
plt.ylabel('Occurences')
plt.xlabel('Mean Daily covid cases')
plt.yscale('log', nonposy='clip')
<ipython-input-7-a1c7b9b46d05>:5: MatplotlibDeprecationWarning: The 'nonposy' parameter of __init__() has been renamed 'nonpositive' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
plt.yscale('log', nonposy='clip')
Using the histogram graph, we can see that the majority of counties have mean COVID-19 cases of less than 75,000. However, there are regions with significantly more daily cases, and we suspect these regions to be highly populated, however we will only be confirmed this hypothesis when doing the clustering. Also note that the graph is converted to log scale for cleaner representation.
In the section, we will discuss our implementation of K-Means and Hierarchical Clustering to
the presented dataset above.
However, before we proceed with the clustering techniques, we need to identify how many clusters is
suitable for our dataset. In order to do so, we pick out a sample data set to run k means clustering
on, with k value set from 1 to 30 to find the best value.
elbowTestData = groupByFips['cases'].apply(list).reset_index(name = 'new')
fips = list(groupByFips.fips.groups.keys())
resultgroupBy = groupByFips['cases'].apply(list).reset_index(name = 'new')
elbowData = resultgroupBy['new'].apply(lambda r: r[-100::5]).to_list()
distortions = []
max_k = 30
for i in range(1, max_k + 1):
km = KMeans(n_clusters = i, init = 'k-means++', max_iter =300, n_init = 10, random_state = 0)
km.fit(elbowData)
distortions.append(km.inertia_)
plt.plot(range(1, max_k + 1), distortions, marker='o')
plt.xlabel('Cluster count (k)')
plt.ylabel('Distortion')
plt.show()
Using the elbow graph presented above, we can see that the best number of cluster should be between 6 and 8,
therefore we decide to pick 7 as our number of clusters in both experiments.
We first start our experiment with K means clustering, because it is seems as a faster method of the two. Because there are some data missing during the first couple months of the pandemic, we have to limit our experiment to use the last 100 days of the data only.
testData = resultgroupBy['new'].apply(lambda r: r[-100:]).to_list()
import numpy as np
result = np.array(testData)
km = KMeans(n_clusters=7, init='k-means++', n_init=10, max_iter=300, random_state=0)
km.fit(result)
KmeansfinalDF = pd.DataFrame(data = fips)
KmeansfinalDF = KmeansfinalDF.rename(columns = {0:'fips'})
KmeansfinalDF['classification'] = km.labels_
silihouette = silhouette_score(testData, km.labels_, metric='euclidean')
print("silihouette_score "+str(silihouette))
davies_bouldin = davies_bouldin_score(testData, km.labels_)
print("davies_bouldin_score "+str(davies_bouldin))
silihouette_score 0.7651258320440756 davies_bouldin_score 0.4032532112579689
With the K means clustering algorithm, we are able to calculate the Silihouette and Davies Bouldin score, which confirm that the algorithm is working right. However, it would also be useful if we can see the result plot onto an US map.
print('Begin plotting')
fig = go.Figure(go.Choroplethmapbox(geojson=counties, locations=KmeansfinalDF['fips'], z=KmeansfinalDF['classification'],
colorscale="Portland", zmin=0, zmax=6,
marker_opacity=0.5, marker_line_width=0))
fig.update_layout(mapbox_style="carto-positron",
mapbox_zoom=3, mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Begin plotting
Similar to the k means clustering above, we then repeat the same experiment using Hierarchical clustering technique.
he =AgglomerativeClustering( n_clusters=7)
he.fit(result)
HEsfinalDF = pd.DataFrame(data = fips)
HEsfinalDF = HEsfinalDF.rename(columns = {0:'fips'})
HEsfinalDF['classification'] = km.labels_
silihouette = silhouette_score(testData, he.labels_, metric='euclidean')
print("silihouette_score "+str(silihouette))
davies_bouldin = davies_bouldin_score(testData, he.labels_)
print("davies_bouldin_score "+str(davies_bouldin))
silihouette_score 0.7468596668094296 davies_bouldin_score 0.432890384005536
We then plot the result onto a US map.
print('Begin plotting')
fig = go.Figure(go.Choroplethmapbox(geojson=counties, locations=HEsfinalDF['fips'], z=HEsfinalDF['classification'],
colorscale="Portland", zmin=0, zmax=6,
marker_opacity=0.5, marker_line_width=0))
fig.update_layout(mapbox_style="carto-positron",
mapbox_zoom=3, mapbox_center = {"lat": 37.0902, "lon": -95.7129})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Begin plotting
First we would like to compare the effectiveness of the two clustering techniques, using the Silihouette and Davies Bouldin score.
According to scikit-learn website, Silihouette score is calculated using the mean intra-cluster distance (a) and
the mean nearest-cluster distance (b) for each sample. The best value is 1 and worst value is -1, with
value near 0 indicates overlapping clusters.
Looking at our result from above, K-means seems to be a better algorithm, with Silihouette score of 0.7651,
while the Hierarchical have the score of 0.7468.
Next we will analyze the two result using Davies Bouldin, which measure the by
the average similarity measure of each cluster with its most similar cluster,
where similarity is the ratio of within-cluster distances to between-cluster distances.
The minimum score is 0, with lower value indicate better clustering. The score we obtain for
K mean is 0.4032, while the score for Hierarchical is 0.4329. This result again confirms that K Means
is a better algorthm between the two.
Because the given data is not extremely large, comparing the run time of the two algorithms is not
feasible, thus we would like to keep this for future work, where dataset size is more suited for
performance conparision.
Next, we would like to point out some interesting facts from our result above.
First of all, the results seems to be very similar, which confirm the facts even though K means is
a more effective algorithm, Hierarchical is still correct and product reliable result.
Next, we can see that highly populated areas such as Los Angeles, New York and Miami are grouped into similar cluster, which confirm our hypothesis that cities with high population are more likely to have similar COVID 19 progression. However, to further valid our hypothesis and see which cluster can be classified as high risk versus low risk, we would take randomly one sample from each cluster and graph it to see the progression over time.
sample = {}
for i in range (0,7):
index = HEsfinalDF[HEsfinalDF['classification'] == i].iloc[0]['fips']
sample[i] = resultgroupBy[resultgroupBy['fips'] == index].iloc[0]['new'][-100:]
x = list(range(0, 100))
import matplotlib.pyplot as plt
for i in range (0,7):
plt.plot(x, sample[i], label = f"cluster {i}")
plt.xlabel('Days')
# Set the y axis label of the current axis.
plt.ylabel('COVID Cases')
plt.title('Covid Cases Over Time ')
# show a legend on the plot
plt.legend()
# Display a figure.
plt.show()
After plotting out the cases progression over time, we can see that cluster 1 has the highest number
of daily covid cases, follow by cluster 4 and 2. Cluster 3 can be considered as medium risk,
where as cluster 0, 5 and 5 is relatively low risk.
Overall, this project have successfully demonstrated the ability and versatile of clustering
technique in analyzing the COVID-19 dataset provided from New York Times. We are able to
identify regions with high, medium and low risks and therefore would make a good vaccination plan, where
high risks areas are prioritized, then medium and low risks regions would follow.
Furthermore, as a future work, it would be interesting to repeat the same experiment with
different time interval to identify regions that move from one cluster to another. This would
identify states/counties with good preventive measurement in place and therefore allows stakeholders
to make an effective preventative plan.
Also, we can cluster regions based on number of confirmed death and see if there are regions that
belong to high risk cluster in terms of daily cases but lower risk in terms of death. This would
mean that they have an effective treatment plan for infected patients.